{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Generation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.\n",
    "\n",
    "Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select Text to Imitate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read in the corpus, including punctuation!\n",
    "import pandas as pd\n",
    "\n",
    "data = pd.read_pickle('corpus.pkl')\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract only Ali Wong's text\n",
    "ali_text = data.transcript.loc['ali']\n",
    "ali_text[:200]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a Markov Chain Function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to build a simple Markov chain function that creates a dictionary:\n",
    "* The keys should be all of the words in the corpus\n",
    "* The values should be a list of the words that follow the keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "def markov_chain(text):\n",
    "    '''The input is a string of text and the output will be a dictionary with each word as\n",
    "       a key and each value as the list of words that come after the key in the text.'''\n",
    "    \n",
    "    # Tokenize the text by word, though including punctuation\n",
    "    words = text.split(' ')\n",
    "    \n",
    "    # Initialize a default dictionary to hold all of the words and next words\n",
    "    m_dict = defaultdict(list)\n",
    "    \n",
    "    # Create a zipped list of all of the word pairs and put them in word: list of next words format\n",
    "    for current_word, next_word in zip(words[0:-1], words[1:]):\n",
    "        m_dict[current_word].append(next_word)\n",
    "\n",
    "    # Convert the default dict back into a dictionary\n",
    "    m_dict = dict(m_dict)\n",
    "    return m_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the dictionary for Ali's routine, take a look at it\n",
    "ali_dict = markov_chain(ali_text)\n",
    "ali_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a Text Generator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to create a function that generates sentences. It will take two things as inputs:\n",
    "* The dictionary you just created\n",
    "* The number of words you want generated\n",
    "\n",
    "Here are some examples of generated sentences:\n",
    "\n",
    ">'Shape right turn– I also takes so that she’s got women all know that snail-trail.'\n",
    "\n",
    ">'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "def generate_sentence(chain, count=15):\n",
    "    '''Input a dictionary in the format of key = current word, value = list of next words\n",
    "       along with the number of words you would like to see in your generated sentence.'''\n",
    "\n",
    "    # Capitalize the first word\n",
    "    word1 = random.choice(list(chain.keys()))\n",
    "    sentence = word1.capitalize()\n",
    "\n",
    "    # Generate the second word from the value list. Set the new word as the first word. Repeat.\n",
    "    for i in range(count-1):\n",
    "        word2 = random.choice(chain[word1])\n",
    "        word1 = word2\n",
    "        sentence += ' ' + word2\n",
    "\n",
    "    # End it with a period\n",
    "    sentence += '.'\n",
    "    return(sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_sentence(ali_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Additional Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
